Goto

Collaborating Authors

 spherical image





Learning Spherical Convolution for Fast Features from 360° Imagery

Neural Information Processing Systems

While 360 cameras offer tremendous new possibilities in vision, graphics, and augmented reality, the spherical images they produce make core feature extraction non-trivial. Convolutional neural networks (CNNs) trained on images from perspective cameras yield "flat" filters, yet 360 images cannot be projected to a single plane without significant distortion. A naive solution that repeatedly projects the viewing sphere to all tangent planes is accurate, but much too computationally intensive for real problems. We propose to learn a spherical convolutional network that translates a planar CNN to process 360 imagery directly in its equirectangular projection. Our approach learns to reproduce the flat filter outputs on 360 data, sensitive to the varying distortion effects across the viewing sphere. The key benefits are 1) efficient feature extraction for 360 images and video, and 2) the ability to leverage powerful pre-trained networks researchers have carefully honed (together with massive labeled image training sets) for perspective images. We validate our approach compared to several alternative methods in terms of both raw CNN output accuracy as well as applying a state-of-the-art "flat" object detector to 360 data. Our method yields the most accurate results while saving orders of magnitude in computation versus the existing exact reprojection solution.


Geometry Fidelity for Spherical Images

arXiv.org Artificial Intelligence

Spherical or omni-directional images offer an immersive visual format appealing to a wide range of computer vision applications. However, geometric properties of spherical images pose a major challenge for models and metrics designed for ordinary 2D images. Here, we show that direct application of Fr\'echet Inception Distance (FID) is insufficient for quantifying geometric fidelity in spherical images. We introduce two quantitative metrics accounting for geometric constraints, namely Omnidirectional FID (OmniFID) and Discontinuity Score (DS). OmniFID is an extension of FID tailored to additionally capture field-of-view requirements of the spherical format by leveraging cubemap projections. DS is a kernel-based seam alignment score of continuity across borders of 2D representations of spherical images. In experiments, OmniFID and DS quantify geometry fidelity issues that are undetected by FID.


3D Scene Geometry Estimation from 360$^\circ$ Imagery: A Survey

arXiv.org Artificial Intelligence

The world is three-dimensional (3D). As such, recovering 3D information about real-world objects allows the exploration of many relevant applications, including self-driving cars [1, 2], robot navigation [3, 4], virtual tourism [5, 6], infrastructure inspection [7, 8], archaeological [9, 10] and architectural modeling [5, 11], city planning [12, 13], and 3D cinema [14, 15]. Many sensors can be used to obtain 3D data from real objects, such as light detection and ranging [16], structured light [17], and time of flight [18]. There is a plethora of approaches for inferring 3D information from plain color images/videos. The widespread accessibility and low-cost of consumer cameras is a strong motivation for the continued research efforts devoted to image-based 3D scene reconstruction methods [19]. In theory, 3D information can only be inferred from two or more captures of the scene, as in typical multi-view stereo [20] or structure from motion [21] approaches. However, recent approaches are exploring machine learning to perform single-image depth inference [22, 23, 24]. Most methods developed so far rely on traditional perspective/pinhole-based cameras, which have a narrow field of view (FoV) and thus might require thousands of captures to model large scenes [25, 26].


Attention-Enhanced Cross-modal Localization Between 360 Images and Point Clouds

arXiv.org Artificial Intelligence

Visual localization plays an important role for intelligent robots and autonomous driving, especially when the accuracy of GNSS is unreliable. Recently, camera localization in LiDAR maps has attracted more and more attention for its low cost and potential robustness to illumination and weather changes. However, the commonly used pinhole camera has a narrow Field-of-View, thus leading to limited information compared with the omni-directional LiDAR data. To overcome this limitation, we focus on correlating the information of 360 equirectangular images to point clouds, proposing an end-to-end learnable network to conduct cross-modal visual localization by establishing similarity in high-dimensional feature space. Inspired by the attention mechanism, we optimize the network to capture the salient feature for comparing images and point clouds. We construct several sequences containing 360 equirectangular images and corresponding point clouds based on the KITTI-360 dataset and conduct extensive experiments. The results demonstrate the effectiveness of our approach.


Learning Spherical Convolution for Fast Features from 360° Imagery

Neural Information Processing Systems

While 360° cameras offer tremendous new possibilities in vision, graphics, and augmented reality, the spherical images they produce make core feature extraction non-trivial. Convolutional neural networks (CNNs) trained on images from perspective cameras yield “flat" filters, yet 360° images cannot be projected to a single plane without significant distortion. A naive solution that repeatedly projects the viewing sphere to all tangent planes is accurate, but much too computationally intensive for real problems. We propose to learn a spherical convolutional network that translates a planar CNN to process 360° imagery directly in its equirectangular projection. Our approach learns to reproduce the flat filter outputs on 360° data, sensitive to the varying distortion effects across the viewing sphere. The key benefits are 1) efficient feature extraction for 360° images and video, and 2) the ability to leverage powerful pre-trained networks researchers have carefully honed (together with massive labeled image training sets) for perspective images. We validate our approach compared to several alternative methods in terms of both raw CNN output accuracy as well as applying a state-of-the-art “flat" object detector to 360° data. Our method yields the most accurate results while saving orders of magnitude in computation versus the existing exact reprojection solution.